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Abstract 

Neural machine translation (NMT) mod¬ 
els typically operate with a fixed vocabu¬ 
lary, but translation is an open-vocabulary 
problem. Previous work addresses the 
translation of out-of-vocabulary words by 
backing off to a dictionary. In this pa¬ 
per, we introduce a simpler and more ef¬ 
fective approach, making the NMT model 
capable of open-vocabulary translation by 
encoding rare and unknown words as se¬ 
quences of subword units. This is based on 
the intuition that various word classes are 
translatable via smaller units than words, 
for instance names (via character copying 
or transliteration), compounds (via com¬ 
positional translation), and cognates and 
loanwords (via phonological and morpho¬ 
logical transformations). We discuss the 
suitability of different word segmentation 
techniques, including simple character n- 
gram models and a segmentation based on 
the byte pair encoding compression algo¬ 
rithm, and empirically show that subword 
models improve over a back-off dictionary 
baseline for the WMT 15 translation tasks 
English^German and English—)>Russian 
by up to 1.1 and 1.3 Bleu, respectively. 

1 Introduction 

Neural machine translation has recently shown 
impressive results (Kalchbrenner and Blunsom, 
2013; Sutskever et al., 2014; Bahdanau et al., 
2015). However, the translation of rare words 
is an open problem. The vocabulary of neu¬ 
ral models is typically limited to 30000-50000 
words, but translation is an open-vocabulary prob- 
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lem, and especially for languages with produc¬ 
tive word formation processes such as aggluti¬ 
nation and compounding, translation models re¬ 
quire mechanisms that go below the word level. 
As an example, consider compounds such as the 
German Abwasser\behandlungs\anlange ‘sewage 
water treatment plant’, for which a segmented, 
variable-length representation is intuitively more 
appealing than encoding the word as a fixed-length 
vector. 

Eor word-level NMT models, the translation 
of out-of-vocabulary words has been addressed 
through a back-off to a dictionary look-up (Jean et 
al., 2015; Euong et al., 2015b). We note that such 
techniques make assumptions that often do not 
hold true in practice. Eor instance, there is not al¬ 
ways a 1-to-l correspondence between source and 
target words because of variance in the degree of 
morphological synthesis between languages, like 
in our introductory compounding example. Also, 
word-level models are unable to translate or gen¬ 
erate unseen words. Copying unknown words into 
the target text, as done by (Jean et al., 2015; Euong 
et al., 2015b), is a reasonable strategy for names, 
but morphological changes and transliteration is 
often required, especially if alphabets differ. 

We investigate NMT models that operate on the 
level of subword units. Our main goal is to model 
open-vocabulary translation in the NMT network 
itself, without requiring a back-off model for rare 
words. In addition to making the translation pro¬ 
cess simpler, we also find fhaf fhe subword models 
achieve beffer accuracy for fhe franslafion of rare 
words fhan large-vocabulary models and back-off 
dictionaries, and are able fo producfively generate 
new words fhaf were nol seen al Iraining time. Our 
analysis shows fhaf fhe neural nefworks are able lo 
learn compounding and Iransliferalion from sub¬ 
word represenlafions. 

This paper has Iwo main conlribulions: 

• We show fhaf open-vocabulary neural ma- 



chine translation is possible by encoding 
(rare) words via subword units. We find our 
architecture simpler and more effective than 
using large vocabularies and back-off dictio¬ 
naries (Jean et ah, 2015; Luong et ah, 2015b). 

• We adapt byte pair encoding (BPE) (Gage, 
1994), a compression algorithm, to the task 
of word segmentation. BPE allows for the 
representation of an open vocabulary through 
a fixed-size vocabulary of variable-length 
character sequences, making it a very suit¬ 
able word segmentation strategy for neural 
network models. 

2 Neural Machine Translation 

We follow the neural machine translation archi¬ 
tecture by Bahdanau et al. (2015), which we will 
briefly summarize here. However, we note that our 
approach is not specific to this architecture. 

The neural machine translation system is imple¬ 
mented as an encoder-decoder network with recur¬ 
rent neural networks. 

The encoder is a bidirectional neural network 
with gated recurrent units (Cho et ah, 2014) 
that reads an input sequence x = (xi,...,Xm) 
and calculates a forward sequence of hidden 
states {hi,..., hm), and a backward science 
{hi,..., hm)- The hidden states h j and h j are 
concatenated to obtain the annotation vector hj. 

The decoder is a recurrent neural network that 
predicts a target sequence y = (yi,..., y„). Each 
word yi is predicted based on a recurrent hidden 
state Si, the previously predicted word yi-i, and 
a context vector Cj. Cj is computed as a weighted 
sum of the annotations hj. The weight of each 
annotation hj is computed through an alignment 
model Uij, which models the probability that yi is 
aligned to xj. The alignment model is a single¬ 
layer feedforward neural network that is learned 
jointly with the rest of the network through back- 
propagation. 

A detailed description can be found in (Bah¬ 
danau et ah, 2015). Training is performed on a 
parallel corpus with stochastic gradient descent. 
Eor translation, a beam search with small beam 
size is employed. 

3 Subword Translation 

The main motivation behind this paper is that 
the translation of some words is transparent in 


that they are translatable by a competent transla¬ 
tor even if they are novel to him or her, based 
on a translation of known subword units such as 
morphemes or phonemes. Word categories whose 
translation is potentially transparent include: 

• named entities. Between languages that share 
an alphabet, names can often be copied from 
source to target text. Transcription or translit¬ 
eration may be required, especially if the al¬ 
phabets or syllabaries differ. Example: 
Barack Obama (English; German) 

BapaK Ofiaivia (Russian) 

(ba-ra-ku o-ba-ma) (Japanese) 

• cognates and loanwords. Cognates and loan¬ 
words with a common origin can differ in 
regular ways between languages, so that 
character-level translation rules are sufficient 
(Tiedemann, 2012). Example: 
claustrophobia (English) 

Klaustrophobie (German) 

KjiaycTpo4)o6HH (Klaustrofobia) (Russian) 

• morphologically complex words. Words con¬ 
taining multiple morphemes, for instance 
formed via compounding, affixation, or in¬ 
flection, may be translatable by translating 
the morphemes separately. Example: 

solar system (English) 

Sonnensystem (Sonne -i- System) (German) 
Naprendszer (Nap -i- Rendszer) (Hungarian) 

In an analysis of 100 rare tokens (not among 
the 50000 most frequent types) in our German 
training data\ the majority of tokens are poten¬ 
tially translatable from English through smaller 
units. We find 56 compounds, 21 names, 
6 loanwords with a common origin {emanci- 
pate^emanzipieren), 5 cases of transparent affix¬ 
ation {sweetish ‘sweet’ -i- ‘-ish’ —> sUfilich ‘suB’ -i- 
‘-lich’), 1 number and 1 computer language iden¬ 
tifier. 

Our hypothesis is that a segmentation of rare 
words into appropriate subword units is suffi¬ 
cient to allow for the neural translation network 
to learn transparent translations, and to general¬ 
ize this knowledge to translate and produce unseen 
words.^ We provide empirical support for this hy- 

' Primarily parliamentary proceedings and web crawl data. 

^Not every segmentation we produce is transparent. 
While we expect no performance benefit from opaque seg¬ 
mentations, i.e. segmentations where the units cannot be 
translated independently, our NMT models show robustness 
towards oversplitting. 



pothesis in Sections 4 and 5. First, we discuss dif¬ 
ferent subword representations. 

3.1 Related Work 

For Statistical Machine Translation (SMT), the 
translation of unknown words has been the subject 
of intensive research. 

A large proportion of unknown words are 
names, which can just be copied into the tar¬ 
get text if both languages share an alphabet. If 
alphabets differ, transliteration is required (Dur¬ 
rani et al., 2014). Character-based translation has 
also been investigated with phrase-based models, 
which proved especially successful for closely re¬ 
lated languages (Vilar et al., 2007; Tiedemann, 
2009; Neubig et al., 2012). 

The segmentation of morphologically complex 
words such as compounds is widely used for SMT, 
and various algorithms for morpheme segmen¬ 
tation have been investigated (NieBen and Ney, 
2000; Koehn and Knight, 2003; Virpioja et al., 
2007; Stallard et al., 2012). Segmentation al¬ 
gorithms commonly used for phrase-based SMT 
tend to be conservative in their splitting decisions, 
whereas we aim for an aggressive segmentation 
that allows for open-vocabulary translation with a 
compact network vocabulary, and without having 
to resort to back-off dictionaries. 

The best choice of subword units may be task- 
specific. For speech recognition, phone-level lan¬ 
guage models have been used (Bazzi and Glass, 
2000). Mikolov et al. (2012) investigate subword 
language models, and propose to use syllables. 
For multilingual segmentation tasks, multilingual 
algorithms have been proposed (Snyder and Barzi- 
lay, 2008). We find fhese infriguing, buf inapplica¬ 
ble al lesl lime. 

Various lechniques have been proposed lo pro¬ 
duce fixed-lenglh continuous word veclors based 
on characfers or morphemes (Luong el al., 2013; 
Bolha and Blunsom, 2014; Ling el al., 2015a; Kim 
el al., 2015). An efforl lo apply such lechniques 
lo NMT, parallel lo ours, has found no signilicanl 
improvemenl over word-based approaches (Ling 
el al., 2015b). One technical difference from our 
work is lhal Ihe attention mechanism still oper¬ 
ates on Ihe level of words in Ihe model by Ling 
el al. (2015b), and lhal Ihe represenlalion of each 
word is fixed-lenglh. We expecl lhal Ihe attention 
mechanism benelils from our variable-lenglh rep- 
resenlalion: Ihe nelwork can learn lo place atten¬ 


tion on differenl subword unils al each step. Re¬ 
call our inlroduclory example Abwasserbehand- 
lungsanlange, for which a subword segmenlalion 
avoids Ihe information bottleneck of a fixed-lenglh 
represenlalion. 

Neural machine Iranslalion differs from phrase- 
based melhods in lhal Ihere are slrong incentives lo 
minimize Ihe vocabulary size of neural models lo 
increase time and space efficiency, and lo allow for 
Iranslalion wilhoul back-off models. Al Ihe same 
time, we also wanl a compacl represenlalion of Ihe 
lexl ilself, since an increase in lexl lenglh reduces 
efficiency and increases Ihe dislances over which 
neural models need lo pass information. 

A simple melhod lo manipulate Ihe Irade-off be- 
Iween vocabulary size and lexl size is lo use shorl- 
lisls of unsegmenled words, using subword unils 
only for rare words. As an alternative, we pro¬ 
pose a segmenlalion algorilhm based on byte pair 
encoding (BPE), which lels us learn a vocabulary 
lhal provides a good compression rate of Ihe text 

3.2 Byte Pair Encoding (BPE) 

Byte Pair Encoding (BPE) (Gage, 1994) is a sim¬ 
ple dala compression technique lhal iteratively re¬ 
places Ihe mosl frequenl pair of bytes in a se¬ 
quence wilh a single, unused byte. We adapl Ihis 
algorilhm for word segmenlalion. Instead of merg¬ 
ing frequenl pairs of bytes, we merge characters or 
character sequences. 

Eirslly, we initialize Ihe symbol vocabulary wilh 
Ihe character vocabulary, and represenl each word 
as a sequence of characters, plus a special end-of- 
word symbol which allows us lo restore Ihe 
original lokenizalion after Iranslalion. We itera¬ 
tively counl all symbol pairs and replace each oc¬ 
currence of Ihe mosl frequenl pair (‘A’, ‘B’) wilh 
a new symbol ‘AB’. Each merge operation pro¬ 
duces a new symbol which represenls a charac¬ 
ter n-gram. Erequenl character n-grams (or whole 
words) are evenlually merged into a single sym¬ 
bol, Ihus BPE requires no shorllisl. The final sym¬ 
bol vocabulary size is equal to Ihe size of Ihe initial 
vocabulary, plus Ihe number of merge operations 
- Ihe latter is Ihe only hyperparameter of Ihe algo¬ 
rilhm. 

Eor efficiency, we do nol consider pairs lhal 
cross word boundaries. The algorilhm can Ihus be 
run on Ihe diclionary exlracled from a lexl, wilh 
each word being weighted by ils frequency. A 
minimal Pylhon implemenlalion is shown in Al- 



Algorithm 1 Learn BPE operations 


import re, collections 

def get_stats(vocab): 

pairs = collections.defaultdict(int) 
for word, freq in vocab.items(): 
symbols = word.split() 
for i in range{len(symbols)-1): 

pairs[symbols[i],symbols[i+1]] += freq 
return pairs 

def merge_vocab(pair, v_in): 
v_out = {} 

bigram = re.escape(' join (pair)) 
p = re.compile(r' (?<!\S) ’ + bigram + r' (?!\S) ’) 
for word in v_in: 

w_out = p.sub ('*.join(pair), word) 
v_out[w_out] = v_in[word] 
return v_out 

vocab = {'1 o w </w>' : 5, 'lower </w>' : 2, 

'newest </w>':6, 'widest </w>':3} 
num_merges = 10 
for i in range(num_merges): 
pairs = get_stats(vocab) 
best = max(pairs, key=pairs.get) 
vocab = merge_vocab(best, vocab) 
print (best) 


r • —^ r* 

1 o —> lo 

lo w —> low 

e r- —>■ er- 

Figure 1: BPE merge operations learned from die¬ 
tionary {‘low’, ‘lowest’, ‘newer’, ‘wider’}. 

gorithm 1. In praetiee, we inerease effieieney by 
indexing all pairs, and updating data struetures in- 
erementally. 

The main differenee to other eompression al¬ 
gorithms, sueh as Huffman eneoding, whieh have 
been proposed to produee a variable-length en¬ 
eoding of words for NMT (Chitnis and DeNero, 
2015), is that our symbol sequenees are still in¬ 
terpretable as subword units, and that the network 
ean generalize to translate and produee new words 
(unseen at training time) on the basis of these sub¬ 
word units. 

Figure 1 shows a toy example of learned BPE 
operations. At test time, we first split words into 
sequenees of eharaeters, then apply the learned op¬ 
erations to merge the eharaeters into larger, known 
symbols. This is applieable to any word, and 
allows for open-voeabulary networks with fixed 
symbol voeabularies.^ In our example, the OOV 
‘lower’ would be segmented into ‘low er- ’. 

^The only symbols that will be unknown at test time are 
unknown characters, or symbols of which all occurrences 
in the training text have been merged into larger symbols, 
like ‘safeguar’, which has all occurrences in our training text 
merged into ‘safeguard’. We observed no such symbols at 
test time, but the issue could be easily solved by recursively 
reversing specific merges until all symbols are known. 


We evaluate two methods of applying BPE: 
learning two independent encodings, one for the 
source, one for the target vocabulary, or learning 
the encoding on the union of the two vocabular¬ 
ies (which we callyo/nf BPE).^ The former has the 
advantage of being more compact in terms of text 
and vocabulary size, and having stronger guaran¬ 
tees that each subword unit has been seen in the 
training text of the respective language, whereas 
the latter improves consistency between the source 
and the target segmentation. If we apply BPE in¬ 
dependently, the same name may be segmented 
differently in the two languages, which makes it 
harder for the neural models to learn a mapping 
between the subword units. To increase the con¬ 
sistency between English and Russian segmenta¬ 
tion despite the differing alphabets, we transliter¬ 
ate the Russian vocabulary into Eatin characters 
with ISO-9 to learn the joint BPE encoding, then 
transliterate the BPE merge operations back into 
Cyrillic to apply them to the Russian training text.^ 

4 Evaluation 

We aim to answer the following empirical ques¬ 
tions: 

• Can we improve the translation of rare and 
unseen words in neural machine translation 
by representing them via sub word units? 

• Which segmentation into subword units per¬ 
forms best in terms of vocabulary size, text 
size, and translation quality? 

We perform experiments on data from the 
shared translation task of WMT 2015. For 
English—^German, our training set consists of 4.2 
million sentence pairs, or approximately 100 mil¬ 
lion tokens. For English—)■ Russian, the training set 
consists of 2.6 million sentence pairs, or approx¬ 
imately 50 million tokens. We tokenize and true- 
case the data with the scripts provided in Moses 
(Koehn et al., 2007). We use newstest2013 as de¬ 
velopment set, and report results on newstest2014 
and newstest2015. 

We report results with Bleu (mteval-vl3a.pl), 
and CHRF3 (Popovic, 2015), a character n-gram 
F 3 score which was found to correlate well with 

‘*In practice, we simply concatenate the source and target 
side of the training set to learn joint BPE. 

^ Since the Russian training text also contains words that 
use the Latin alphabet, we also apply the Latin BPE opera¬ 
tions. 



human judgments, especially for translations out 
of English (Stanojevic et ah, 2015). Since our 
main claim is concerned with the translation of 
rare and unseen words, we report separate statis¬ 
tics for these. We measure these through unigram 
Fi, which we calculate as the harmonic mean of 
clipped unigram precision and recall.^ 

We perform all experiments with Groundhog^ 
(Bahdanau et ah, 2015). We generally follow set¬ 
tings by previous work (Bahdanau et ah, 2015; 
Jean et ah, 2015). All networks have a hidden 
layer size of 1000, and an embedding layer size 
of 620. Following Jean et al. (2015), we only keep 
a shortlist of r = 30000 words in memory. 

During training, we use Adadelta (Zeiler, 2012), 
a minibatch size of 80, and reshuffle the train¬ 
ing set between epochs. We train a network for 
approximately 7 days, then take the last 4 saved 
models (models being saved every 12 hours), and 
continue training each with a fixed embedding 
layer (as suggested by (Jean et ah, 2015)) for 12 
hours. We perform two independent training runs 
for each models, once with cut-off for gradient 
clipping (Pascanu et ah, 2013) of 5.0, once with 
a cut-off of 1.0 - the latter produced better single 
models for most settings. We report results of the 
system that performed best on our development set 
(newstest2013), and of an ensemble of all 8 mod¬ 
els. 

We use a beam size of 12 for beam search, 
with probabilities normalized by sentence length. 
We use a bilingual dictionary based on fast-align 
(Dyer et ah, 2013). For our baseline, this serves 
as back-off dictionary for rare words. We also use 
the dictionary to speed up translation for all ex¬ 
periments, only performing the softmax over a fil¬ 
tered list of candidate translations (like Jean et al. 
(2015), we use K = 30000; K' = 10). 

4.1 Subword statistics 

Apart from translation quality, which we will ver¬ 
ify empirically, our main objective is to represent 
an open vocabulary through a compact fixed-size 
subword vocabulary, and allow for efficient train¬ 
ing and decoding.^ 

Statistics for different segmentations of the Ger- 

^Clipped unigram precision is essentially 1-gram BLEU 
without brevity penalty. 

git hub. com/sebastien- j /LV_groundhog 

*The time complexity of encoder-decoder architectures is 
at least linear to sequence length, and oversplitting harms ef- 
hciency. 


man side of the parallel data are shown in Table 
1. A simple baseline is the segmentation of words 
into character n-grams.^ Character n-grams allow 
for different trade-offs between sequence length 
(# tokens) and vocabulary size (# types), depend¬ 
ing on the choice of n. The increase in sequence 
length is substantial; one way to reduce sequence 
length is to leave a shortlist of the k most frequent 
word types unsegmented. Only the unigram repre¬ 
sentation is truly open-vocabulary. However, the 
unigram representation performed poorly in pre¬ 
liminary experiments, and we report translation re¬ 
sults with a bigram representation, which is empir¬ 
ically better, but unable to produce some tokens in 
the test set with the training set vocabulary. 

We report statistics for several word segmenta¬ 
tion techniques that have proven useful in previous 
SMT research, including frequency-based com¬ 
pound splitting (Koehn and Knight, 2003), rule- 
based hyphenation (Fiang, 1983), and Morfessor 
(Creutz and Fagus, 2002). We find that they only 
moderately reduce vocabulary size, and do not 
solve the unknown word problem, and we thus find 
them unsuitable for our goal of open-vocabulary 
translation without back-off dictionary. 

BPE meets our goal of being open-vocabulary, 
and the learned merge operations can be applied 
to the test set to obtain a segmentation with no 
unknown symbols. Its main difference from 
the character-level model is that the more com¬ 
pact representation of BPE allows for shorter se¬ 
quences, and that the attention model operates 
on variable-length units. Table 1 shows BPE 
with 59 500 merge operations, and joint BPE with 
89 500 operations. 

In practice, we did not include infrequent sub¬ 
word units in the NMT network vocabulary, since 
there is noise in the subword symbol sets, e.g. 
because of characters from foreign alphabets. 
Hence, our network vocabularies in Table 2 are 
typically slightly smaller than the number of types 
in Table 1. 


^Our character n-grams do not cross word boundaries. We 
mark whether a subword is word-final or not with a special 
character, which allows us to restore the original tokenization. 

*°Joint BPE can produce segments that are unknown be¬ 
cause they only occur in the English training text, but these 
are rare (0.05% of test tokens). 

'*We highlighted the limitations of word-level attention in 
section 3.1. At the other end of the spectrum, the character 
level is suboptimal for alignment (Tiedemann, 2009). 






vocabulary 

Bleu 

chrF3 

unigram Fi (%) 

name 

segmentation 

shortlist 

source 

target 

single 

ens-8 

single 

ens-8 

all 

rare 

OOV 

syntax-based (Sennrich and Haddow, 2015) 

24.4 

- 

55.3 

- 

59.1 

46.0 

37.7 

WUnk 

- 

- 

300000 

500000 

20.6 

22.8 

47.2 

48.9 

56.7 

20.4 

0.0 

WDict 

- 

- 

300000 

500000 

22.0 

24.2 

50.5 

52.4 

58.1 

36.8 

36.8 

C2-50k 

char-bigram 

50000 

60000 

60000 

22.8 

25.3 

51.9 

53.5 

58.4 

40.5 

30.9 

BPE-60k 

BPE 

- 

60000 

60000 

21.5 

24.5 

52.0 

53.9 

58.4 

40.9 

29.3 

BPE-J90k 

BPE (joint) 

- 

90000 

90000 

22.8 

24.7 

51.7 

54.1 

58.5 

41.8 

33.6 


Table 2: English—^German translation performance (Bleu, CHRF3 and unigram Fi) on newstest2015. 
Ens-8: ensemble of 8 models. Best NMT system in bold. Unigram Fi (with ensembles) is computed for 
all words (n = 44085), rare words (not among top 50000 in training set; n = 2900), and OOVs (not in 
training set; n = 1168). 


segmentation 

# tokens 

# types 

#UNK 

none 

100 m 

1750 000 

1079 

characters 

550 m 

3000 

0 

character bigrams 

306 m 

20 000 

34 

character trigrams 

214 m 

120 000 

59 

compound splitting^ 

102 m 

1 100 000 

643 

morfessor* 

109 m 

544 000 

237 

hyphenation* 

186 m 

404 000 

230 

BPE 

112 m 

63 000 

0 

BPE (joint) 

111 m 

82 000 

32 

character bigrams 
(shortlist: 50000) 

129 m 

69 000 

34 


Table 1: Corpus statistics for German training 
corpus with different word segmentation tech¬ 
niques. #UNK: number of unknown tokens in 
newstest2013. A: (Koehn and Knight, 2003); *: 
(Creutz and Fagus, 2002); o: (Fiang, 1983). 

4.2 Translation experiments 

English^German translation results are shown in 
Table 2; English—s-Russian results in Table 3. 

Our baseline WDict is a word-level model with 
a back-off dictionary. It differs from WUnk in that 
the latter uses no back-off dictionary, and just rep¬ 
resents out-of-vocabulary words as UNK^^. The 
back-off dictionary improves unigram Fi for rare 
and unseen words, although the improvement is 
smaller for English—)-Russian, since the back-off 
dictionary is incapable of transliterating names. 

All subword systems operate without a back-off 
dictionary. We hrst focus on unigram Fi, where 
all systems improve over the baseline, especially 
for rare words (36.8%—)-41.8% for EN^DE; 
26.5%^29.7% for EN^RU). For OOVs, the 
baseline strategy of copying unknown words 
works well for English^German. However, when 
alphabets differ, like in English^Russian, the 
subword models do much better. 

'^We use UNK for words that are outside the model vo¬ 
cabulary, and OOV for those that do not occur in the training 
text. 


Unigram Fi scores indicate that learning the 
BPE symbols on the vocabulary union (BPE- 
J90k) is more effective than learning them sep¬ 
arately (BPE-60k), and more effective than using 
character bigrams with a shortlist of 50 000 unseg¬ 
mented words (C2-50k), but all reported subword 
segmentations are viable choices and outperform 
the back-off dictionary baseline. 

Our subword representations cause big im¬ 
provements in the translation of rare and unseen 
words, but these only constitute 9-11% of the test 
sets. Since rare words tend to carry central in¬ 
formation in a sentence, we suspect that Bleu 
and CHRF3 underestimate their effect on transla¬ 
tion quality. Still, we also see improvements over 
the baseline in total unigram Fi, as well as Bleu 
and CHRF3, and the subword ensembles outper¬ 
form the WDict baseline by 0.3-1.3 Bleu and 
0.6-2 CHRF3. There is some inconsistency be¬ 
tween Bleu and chrF3, which we attribute to the 
fact that Bleu has a precision bias, and chrF3 a 
recall bias. 

For English^German, we observe the best 
Bleu score of 25.3 with C2-50k, but the best 
CHRF3 score of 54.1 with BPE-J90k. For com¬ 
parison to the (to our knowledge) best non-neural 
MT system on this data set, we report syntax- 
based SMT results (Sennrich and Haddow, 2015). 
We observe that our best systems outperform the 
syntax-based system in terms of Bleu, but not 
in terms of CHRF3. Regarding other neural sys¬ 
tems, Fuong et al. (2015a) report a Bleu score of 
25.9 on newstest2015, but we note that they use an 
ensemble of 8 independently trained models, and 
also report strong improvements from applying 
dropout, which we did not use. We are confident 
that our improvements to the translation of rare 
words are orthogonal to improvements achievable 
through other improvements in the network archi- 



lecture, training algorithm, or better ensembles. 

For English^Russian, the state of the art is 
the phrase-based system by Haddow et al. (2015). 
It outperforms our WDict baseline by 1.5 Bleu. 
The subword models are a step towards closing 
this gap, and BPE-J90k yields an improvement of 
1.3 Bleu, and 2.0 chrE3, over WDict. 

As a further comment on our translation results, 
we want to emphasize that performance variabil¬ 
ity is still an open problem with NMT. On our de¬ 
velopment set, we observe differences of up to 1 
Bleu between different models. Eor single sys¬ 
tems, we report the results of the model that per¬ 
forms best on dev (out of 8), which has a stabi¬ 
lizing effect, but how to control for randomness 
deserves further attention in future research. 

5 Analysis 

5,1 Unigram accuracy 

Our main claims are that the translation of rare and 
unknown words is poor in word-level NMT mod¬ 
els, and that subword models improve the trans¬ 
lation of these word types. To further illustrate 
the effect of different subword segmentations on 
the translation of rare and unseen words, we plot 
target-side words sorted by their frequency in the 
training set.^^ To analyze the effect of vocabulary 
size, we also include the system C2-3/500k, which 
is a system with the same vocabulary size as the 
WDict baseline, and character bigrams to repre¬ 
sent unseen words. 

Eigure 2 shows results for the English-German 
ensemble systems on newstest2015. Unigram 
El of all systems tends to decrease for lower- 
frequency words. The baseline system has a spike 
in El for OOVs, i.e. words that do not occur in 
the training text. This is because a high propor¬ 
tion of OOVs are names, for which a copy from 
the source to the target text is a good strategy for 
English^German. 

The systems with a target vocabulary of 500 000 
words mostly differ in how well they translate 
words with rank > 500 000. A back-off dictionary 
is an obvious improvement over producing UNK, 
but the subword system C2-3/500k achieves better 
performance. Note that all OOVs that the back¬ 
off dictionary produces are words that are copied 
from the source, usually names, while the subword 

'^We perform binning of words with the same training set 
frequency, and apply bezier smoothing to the graph. 


systems can productively form new words such as 
compounds. 

Eor the 50 000 most frequent words, the repre¬ 
sentation is the same for all neural networks, and 
all neural networks achieve comparable unigram 
El for this category. Eor the interval between fre¬ 
quency rank 50000 and 500000, the comparison 
between C2-3/500k and C2-50k unveils an inter¬ 
esting difference. The two systems only differ in 
the size of the shortlist, with C2-3/500k represent¬ 
ing words in this interval as single units, and C2- 
50k via subword units. We find that the perfor¬ 
mance of C2-3/500k degrades heavily up to fre¬ 
quency rank 500000, at which point the model 
switches to a subword representation and perfor¬ 
mance recovers. The performance of C2-50k re¬ 
mains more stable. We attribute this to the fact 
that subword units are less sparse than words. In 
our training set, the frequency rank 50 000 corre¬ 
sponds to a frequency of 60 in the training data; 
the frequency rank 500000 to a frequency of 2. 
Because subword representations are less sparse, 
reducing the size of the network vocabulary, and 
representing more words via subword units, can 
lead to better performance. 

The El numbers hide some qualitative differ¬ 
ences between systems. Eor English—)>German, 
WDict produces few OOVs (26.5% recall), but 
with high precision (60.6%), whereas the subword 
systems achieve higher recall, but lower precision. 
We note that the character bigram model C2-50k 
produces the most OOV words, and achieves rel¬ 
atively low precision of 29.1% for this category. 
However, it outperforms the back-off dictionary 
in recall (33.0%). BPE-60k, which suffers from 
transliteration (or copy) errors due to segmenta¬ 
tion inconsistencies, obtains a slightly better pre¬ 
cision (32.4%), but a worse recall (26.6%). In con¬ 
trast to BPE-60k, the joint BPE encoding of BPE- 
J90k improves both precision (38.6%) and recall 
(29.8%). 

Eor English^Russian, unknown names can 
only rarely be copied, and usually require translit¬ 
eration. Consequently, the WDict baseline per¬ 
forms more poorly for OOVs (9.2% precision; 
5.2% recall), and the subword models improve 
both precision and recall (21.9% precision and 
15.6% recall for BPE-J90k). The full unigram Ei 
plot is shown in Eigure 3. 






vocabulary 

Bleu 

chrF3 

unigram Fi (%) 

name 

segmentation 

shortlist 

source 

target 

single 

ens-8 

single 

ens-8 

all 

rare 

OOV 

phrase-based (Haddow et al., 2015) 

24.3 

- 

53.8 

- 

56.0 

31.3 

16.5 

WUnk 

- 

- 

300000 

500000 

18.8 

22.4 

46.5 

49.9 

54.2 

25.2 

0.0 

WDict 

- 

- 

300000 

500000 

19.1 

22.8 

47.5 

51.0 

54.8 

26.5 

6.6 

C2-50k 

char-bigram 

50000 

60000 

60000 

20.9 

24.1 

49.0 

51.6 

55.2 

27.8 

17.4 

BPE-60k 

BPE 

- 

60000 

60000 

20.5 

23.6 

49.8 

52.7 

55.3 

29.7 

15.6 

BPE-J90k 

BPE (joint) 

- 

90000 

100000 

20.4 

24.1 

49.7 

53.0 

55.8 

29.7 

18.3 


Table 3: English—s-Russian translation performance (Bleu, CHRF3 and unigram Fi) on newstest2015. 
Ens-8: ensemble of 8 models. Best NMT system in bold. Unigram Fi (with ensembles) is computed for 
all words (n = 55654), rare words (not among top 50000 in training set; n = 5442), and OOVs (not in 
training set; n = 851). 



Figure 2: English^German unigram Fi on new- 
stest2015 plotted by training set frequency rank 
for different NMT systems. 



Figure 3: English^Russian unigram Fi on new- 
stest2015 plotted by training set frequency rank 
for different NMT systems. 


5.2 Manual Analysis 

Table 4 shows two translation examples for 
the translation direction English^German, Ta¬ 
ble 5 for English—^Russian. The baseline sys¬ 
tem fails for all of the examples, either by delet¬ 
ing content (health), or by copying source words 
that should be translated or transliterated. The 
subword translations of health research insti¬ 
tutes show that the subword systems are capa¬ 
ble of learning translations when oversplitting (re- 
search^Fo\rs\ch\un\g), or when the segmentation 
does not match morpheme boundaries: the seg¬ 
mentation Forschungslinstituten would be linguis¬ 
tically more plausible, and simpler to align to the 
English research institutes, than the segmentation 
Forsch\ungsinstitu\ten in the BPE-60k system, but 
still, a correct translation is produced. If the sys¬ 
tems have failed to learn a translation due to data 
sparseness, like for asinine, which should be trans¬ 
lated as dumm, we see translations that are wrong, 
but could be plausible for (partial) loanwords (asi¬ 
nine Situations Asinin-Situation). 

The English^Russian examples show that 
the subword systems are capable of translitera¬ 
tion. However, transliteration errors do occur, 
either due to ambiguous transliterations, or be¬ 
cause of non-consistent segmentations between 
source and target text which make it hard for 
the system to learn a transliteration mapping. 
Note that the BPE-60k system encodes Mina¬ 
yeva inconsistently for the two language pairs 
(M/rzIayeva—?-MHp|3a|eBa Mirlzaleva). This ex¬ 
ample is still translated correctly, but we observe 
spurious insertions and deletions of characters in 
the BPE-60k system. An example is the translit¬ 
eration of rakfisk, where a n is inserted and a k 
is deleted. We trace this error back to transla¬ 
tion pairs in the training data with inconsistent 
segmentations, such as (plraklnh/^npa|KpHT|H 

























system 

sentence 

source 

health research institutes 

reference 

Gesundheitsforschungsinstitute 

WDict 

Forschungsinstitute 

C2-50k 

Foirsichlunigslinistlitiutliolnein 

BPE-60k 

Gesundheitsiforschlungsinstitulten 

BPE-J90k 

Gesundheitsiforschlungsinistitute 

source 

asinine situation 

reference 

dumme Situation 

WDict 

asinine situation —^ UNK —^ asinine 

C2-50k 

aslinlinie situation ^ Aslinlenisiltulatlioln 

BPE-60k 

aslinline situation —>■ Alinlline-ISituation 

BPE-J90K 

aslinline situation —5> Aslinlin-ISituation 


Table 4: English—)-German translation example. 
“I” marks subword boundaries. 


system 

sentence 

source 

Mirzayeva 

reference 

MnpsaeBa (Mirzaeva) 

WDict 

Mirzayeva —> UNK —>■ Mirzayeva 

C2-50k 

Milrziaylevla —>■ Mh pa ae|Ba (Milrziaelva) 

BPE-60k 

Mirziayeva —> Mnp aa eBa (Mirizaleva) 

BPE-J90k 

Mirizalyeva —> Mnp aa eaa (Mirizaleva) 

source 

rakfisk 

reference 

paKcpHCKa (rakfiska) 

WDict 

rakfisk —> UNK —>• rakfisk 

C2-50k 

ralkflisik — > pa Kcp hc|k (ralkflisik) 

BPE-60k 

rakiflisk — >■ npa (p hck (pralflisk) 

BPE-J90k 

rakiflisk —>■ paK (p hcku (rakifliska) 


Table 5: English—^-Russian translation examples. 
“I” marks subword boundaries. 


(pralkritli)), from whieh the translation (rak—7>npa) 
is erroneously learned. The segmentation of the 
joint BPE system (BPE-J90k) is more eonsistent 
{pra Ikr/d/—)-npa|Kpnx|n (pralkritli)). 

6 Conclusion 

The main eontribution of this paper is that we 
show that neural maehine translation systems are 
eapable of open-voeabulary translation by repre¬ 
senting rare and unseen words as a sequenee of 
subword units. This is both simpler and more 
effeetive than using a baek-off translation model. 
We introduee a variant of byte pair eneoding for 
word segmentation, whieh is eapable of eneod¬ 
ing open voeabularies with a eompaet symbol vo- 
eabulary of variable-length sub word units. We 
show performanee gains over the baseline with 
both BPE segmentation, and a simple eharaeter bi¬ 
gram segmentation. 

Our analysis shows that not only out-of- 
voeabulary words, but also rare in-voeabulary 
words are translated poorly by our baseline NMT 

'"^The source code of the segmentation algorithms 
is available at https://github.com/rsennrich/ 
subword-nmt. 


system, and that redueing the voeabulary size 
of subword models ean aetuahy improve perfor¬ 
manee. In this work, our ehoiee of voeabulary size 
is somewhat arbitrary, and mainly motivated by 
eomparison to prior work. One avenue of future 
researeh is to learn the optimal voeabulary size for 
a translation task, whieh we expeet to depend on 
the language pair and amount of training data, au- 
tomatieahy. We also believe there is further po¬ 
tential in bilinguahy informed segmentation algo¬ 
rithms to ereate more alignable sub word units, al¬ 
though the segmentation algorithm eannot rely on 
the target text at runtime. 

While the relative effeetiveness will depend on 
language-speeifie faetors sueh as voeabulary size, 
we believe that subword segmentations are suit¬ 
able for most language pairs, eliminating the need 
for large NMT voeabularies or baek-off models. 
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